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Figure 1. Example living room background activity dataset captured using our tools and methodology: (a) front HD video; (b) rear HD video; (c) Kinect 
facing chairs; (d) Kinect facing couch. All data is time stamped for synchronization. Kinect steams include colour, depth, skeleton, and spatial audio. 
Vicon motion capture of head positions (note tracking hats) was included in 7 sessions. 


ABSTRACT 

In real settings, natural body movements can be erroneously 
recognized by whole-body input systems as explicit input 
actions. We call body activity not intended as input ac¬ 
tions background activity. We argue that understanding back¬ 
ground activity is crucial to the success of always-available 
whole-body input in the real world. To operationalize this 
argument, we contribute a reusable study methodology and 
software tools to generate standardized background activity 
datasets composed of data from multiple Kinect cameras, a 
Vicon tracker, and two high-definition video cameras. Using 
our methodology, we create an example background activity 
dataset for a television-oriented living room setting. We use 
this dataset to demonstrate how it can be used to redesign a 
gestural interaction vocabulary to minimize confiicts with the 
real world. The software tools and initial living room dataset 
are publicly available ^. 
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INTRODUCTION 

Classifying whether body motions were intended as input is 
more than just a technological challenge: doing it incorrectly 
can potentially be deadly. For example, it was recently dis¬ 
covered that people could unintentionally disable their Nest 

^http://www.dgp.toronto.edu/~dustin/ 
backgroundactivity/ 


Protect smoke alarm when normal arm movements were erro¬ 
neously interpreted as a wave to silence gesture Misrecog- 
nising background activity as an explicit input action is an 
example of the Midas touch problem [15]. Midas touch prob¬ 
lems are likely to increase as more always-available whole 
body input systems are deployed in real environments such 
as public places [25], classrooms [5], meeting rooms [1], and 
kitchens [28]. 

We call naturally occurring activity not intended for input 
commands background activity. Since body tracking and ges¬ 
ture recognition is not yet robust in real environments, the po¬ 
tential for Midas touch problems is compounded. Some types 
of unexpected or unusual background activity can foul track¬ 
ing and recognition systems, creating more opportunities for 
misrecognized input. Avoiding erroneous input is critical to 
adoption and usability people cannot be expected to carefully 
constrain their natural motions to avoid misclassification; the 
problem must be tackled directly. We argue that capturing 
background activity for observation and design testing is cru¬ 
cial to improving always-available whole-body input. 

In this paper, we contribute a reusable methodology and sup¬ 
porting software tools to generate standardized background 
activity datasets with 3D motion tracking, depth cameras, 
spatial audio, and high-definition video (Figure 1). Our data 
gathering protocol requires participants to perform explicit 
prompted gestures at regular intervals, so that datasets contain 
controlled foreground activity. To validate our methodology, 
we captured a dataset with 52 person-hours of background 
activity in a television-oriented living room setting, which we 
make available to the community. 


^http://www.bbc.com/news/technology-26879987 
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We ran a proven gesture recognizer for the prompted ges¬ 
tures through our dataset and found a very large number of 
false positives. This reflects the motivation for our study and 
dataset: current whole-body interaction design and gesture 
detection does not consider background activity. As an ap¬ 
plication of background activity datasets, we design a set of 
proposed gestures that correspond semantically to our origi¬ 
nal prompted gestures set. When tested on our dataset, these 
yield substantially less false positives. We include additional 
observations about body postures. 

RELATED WORK 

Large datasets of naturally occurring body movements are 
useful for conducting post hoc observational inquiries, mod¬ 
elling phenomena, motivating technique designs, training al¬ 
gorithms, testing individual techniques, and comparing mul¬ 
tiple techniques with a common baseline. Examples of well- 
established datasets include the MNIST handwritten digit 
database [20] for handwriting recognition, the MacKenzie 
Phrase Set [23] to evaluate text entry techniques, and datasets 
of static objects captured by depth cameras [16, 18] for com¬ 
puter graphics algorithms. Dataset corpora have a strong tra¬ 
dition in natural language processing and have been lever¬ 
aged to make speech input classiflcation robust to background 
speech [4, 10]. In the held of gesture recognition, algorithms 
are trained and tested using datasets similar to Marcels [24] 
compilation of hand gesture and posture images, and to the 
Cambridge Gesture Database [17] of image sequences show¬ 
ing various hand motions. More recently, the Chalearn ges¬ 
ture challenge dataset was established as part of a competition 
in ICMI 2012 to recognize gestures consisting of motion and 
hand shapes in 320x240 Kinect RGB-D data [12]. 

Datasets of whole-body motion exist, but these focus primar¬ 
ily on short sequences of high-energy actions performed by 
actors in a motion capture studio [5, 22, 27, 29]. More re¬ 
cently, the CMU Quality of Life Technology Centre created 
a multimodal capture database of people cooking in a simu¬ 
lated kitchen [7]. With an average of 5 minutes per clip, the 
sequences are too short and too task focused to provide gen¬ 
eral background activity. 

In contrast to pre-existing datasets, we capture much longer 
sequences with minimally invasive equipment and we en¬ 
courage a high degree of social interaction and comfort. 
Rather than clean, segmented sequences of distinct actions, 
we capture realistic, noisy, everyday actions. Unlike previ¬ 
ous datasets, we also intersperse explicit input sequences for 
baseline testing with natural background activity. 

Explicit Input with Gestures 

Using body gestures for explicit input has been extensively 
studied [1, 3, 13, 19, 29, 34, 36, 40]. With always-available 
body input, the difference between gesture and non-gesture 
can be subtle, introducing false positives [22, 28]. Baudel and 
Beaudouin-Lafon [2] call systems that interpret every gesture 
of the user as possible meaning as having immersion syn¬ 
drome, ignoring that interacting with the system is not the 
users only ongoing activity. 


Detecting gestures in a continuous stream of input is known 
as the Gesture Spotting Problem. A common approach is to 
model each gesture type as a Hidden Markov Model (HMM) 
and detect gestures when their likelihood exceeds that of 
a thresholding HMM, synthesized from the trained gesture 
HMMs [21]. The limitation of this approach is that this 
thresholding HMM does not model the background. 

An alternate approach is to design a gesture delimiter that 
rarely occurs naturally. For pen input, Grossman et al. [11] 
logged naturally occurring pen hover motions to design dis¬ 
tinct hover gestures. For device motion gestures, Ruiz and Li 
[30] gathered naturally occurring motion data to design and 
test the distinct DoubleFlip motion gesture delimiter. These 
projects demonstrate the use of background activity data, but 
neither offered a generalizable methodology to capture and 
distribute the data. To our knowledge, there is no dataset that 
can be used to evaluate gestural interfaces in the context of 
naturally occurring whole-body background activity. 

BACKGROUND ACTIVITY 

Background activity is interleaved with all interface input, but 
some input techniques explicitly differentiate between input 
and non-input actions using an explicit control signal. As a 
simple example, consider that hand movement is only used 
for cursor control when a mouse is manipulated — all other 
movements away from the mouse are easily ignored. 

When whole-body input systems constantly track the move¬ 
ments of body parts, they can become confounded by the am¬ 
biguity between background activity and explicit control. The 
reason is that control signals can often be very similar to typ¬ 
ical background activity movements [28]. An outstretched 
arm with a pointed index Anger could be a gesture to select a 
location on a computer display (foreground activity) or a de¬ 
ictic gesture to support human communication (background 
activity). The problem is compounded in active environments 
where multiple people are multi-tasking with others, or where 
the physical environment is not conducive to careful, explicit 
movements. 

In computer vision, background subtraction is a common 
method to separate objects of interest using a model of the 
image background [35]. The separation of foreground ob¬ 
jects (explicit input) is achieved by a deep understanding of 
the background scene (i.e., background activity). We argue 
that the whole-body input research can use an analogous ap¬ 
proach. Current gesture and motion training datasets [6, 33] 
are not suitable; a corpus of background activity datasets in 
realistic environments is needed, as are a methodology and 
tools to enable collection of additional datasets. 

Approaches to Managing Background Activity 

There are multiple gesture detection approaches to distin¬ 
guish foreground activity from noisy background activity. We 
present the most common approaches: 

Explicit Clutch — The system only responds when in a spe- 
ciflc user-determined state. For example, Sapponas et al. [31] 
use a clenched left flst to enter a gesture recognition state for 
the right hand. 
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Delimiter — A special gesture indicates that an input se¬ 
quence is about to begin. Sometimes the delimiter is multi¬ 
modal [5], but when using pure whole-body input, a unique 
gesture can be used [39]. There are varying ways to indicate 
the end of the interaction sequence, such as ending after the 
first gesture recognized [30], or after a period of inactivity. 
Hudson et. al. [14] use the term framing gesture when a de¬ 
limiter is performed before and after the interaction sequence. 

Implicit Clutch — The system examines the gesturing context 
to determine when a hand motion should be interpreted as 
input. Baudel and Beaudouin-Lafon [2] and Foumey [8] use 
a spatial active zone: when the hand is in a certain area, the 
movements are recognized. Other features can be used, such 
as body pose and gaze [32], to determine when hand gestures 
are intended as input. 

Always-On — The system constantly tracks and responds to 
any motions that are intended for input. This requires mo¬ 
tions to be clearly distinguished from background activity. 
The challenge is to find unique, distinct gestures. This is the 
approach taken by the Nest Protect smoke alarm. 

There are benefits and issues with these approaches. An ex¬ 
plicit clutch is clear, but requires user attention to be main¬ 
tained. Delimiters do not have to be maintained during the 
interaction, but require extra time at the beginning. Assum¬ 
ing a robust clutch or delimiter can be found, it can still feel 
awkward and require extra cognitive effort to use. Another 
problem presented by a strict delimiter approach is that it pre¬ 
cludes detecting movements that could be used as implicit 
input, like sensing emotional affect [1] or level of attention 
[37]. 

Always-on and implicit clutch are the most natural modes of 
interaction. To be usable and reliable, they require the deep¬ 
est understanding of background activity. The active zone 
implicit clutch works well in a presentation context, but is 
unlikely to generalize. The hand wave used by the always-on 
Nest Protect was not unique enough. 

Our approach is to study background activity to move closer 
to the goals of implicit clutches and always-on input. In the 
next section we describe a principled way to capture back¬ 
ground activity datasets. We then demonstrate the usefulness 
of our initial dataset to find unique and robust gestures that 
do not frequently occur in background activity. This approach 
could be combined with others, for example these unique ges¬ 
tures could be used as a delimiter as well and the dataset could 
be used to design and test implicit clutches. 

Establishing a Methodology for Dataset Building 

Our objective is to establish a repeatable methodology for 
capturing an ecologically valid recording of whole-body 
background activity in a form suitable for distribution. In 
this section we establish a study protocol that includes oc¬ 
casional prompted foreground activity segments for baseline 
comparison, provide format specifications for a public do¬ 
main dataset, and describe our logging and analysis soft¬ 
ware. We use our methodology to capture background ac¬ 
tivity in a television-oriented living room, a plausible context 
for whole-body interaction. 



Figure 2. Living room environment with seating and large screen tele¬ 
vision. (a) small display for prompted foreground activity gestures; (b) 
Kinect cameras; (c) HD cameras. 


Eliciting Background Activity 

Unlike typical methodologies where people are instructed 
to perform specific motions, asking people to act out back¬ 
ground activity would not produce realistic results. We there¬ 
fore advocate creating a physical and social environment that 
allows background activity to emerge naturally. For our sam¬ 
ple dataset, we created a laboratory living room setting with 
a game console and television (Figure 2). To increase so¬ 
cial interaction, we only recruited participants who had exist¬ 
ing social relationships with each other. To encourage object 
manipulation background activity, we provided snacks and 
drinks. 

To gain full benefit from the dataset, the inclusion of typical 
foreground gesture activity is essential to serve as a compar¬ 
ative baseline. We achieve this by occasionally prompting 
participants in a subset of groups to perform one of four com¬ 
mon gestures. Our methodology may be extended for testing 
a particular gesture language, by adding those gestures to the 
experimental protocol. This can even be done before recog¬ 
nizers have been built for those gestures and used to inform 
the recognizer implementation. 

CAPTURE PROTOCOL 
Physical Environment Setup 

We simulated a 4 m by 4 m living room with comfortable 
furniture and used soft incandescent lighting and curtains to 
hide the institutional walls (Figure 2). We placed two arm¬ 
chairs and a two-person sofa in front of a 54” television with 
external speakers approximately 2m away. Participants could 
watch Netfiix programs or play video games, controlled us¬ 
ing a single wired Xbox controller. We intentionally provided 
a single controller to increase background activity: controller 
usage had to be socially negotiated and transferred. Similarly, 
background activity was encouraged when selecting a video 
game from a stack on the floor and inserting it into the Xbox. 

To maintain an unobstructed view of the participants, we 
placed a small coffee table between the couch and the nearest 
armchair, rather than in front. This table held food and other 
personal belongings within arms reach of the two nearest par¬ 
ticipants. This was another intentional choice to encourage 
background activity, since items on the table needed to be 
passed to the two outer participants. 


3 


Capturing Apparatus 

We used minimally invasive capture equipment to capture 
each study group of 4 participants. A wide-angle HD video 
camera captured audio and video of the entire scene from the 
front (Figure la) and a second HD camera captured from be¬ 
hind, including the gesture prompt screen and television con¬ 
tent (Figure lb). 

One Kinect faced the sofa (Figure Id), and the other faced 
the armchairs (Figure Ic). Each Kinect recorded 13 bit, 640 
by 480px depth with 3 bits of player id masks (pixels classi¬ 
fied as part of a human body), 640 by 480px RGB video, 20 
segment skeleton tracking (when possible), and spatially sep¬ 
arated sound using Microsoft Kinect SDK version 1.5. When 
used, a six-camera Vicon system placed high in the ceiling 
tracked head position and orientation of all four participants 
using four lightweight hats. We were concerned that the Vi¬ 
con tracking hats would affect behaviour, so we used them 
with only a subset of groups in order to increase the breadth 
of our sample dataset. 

We found that the built-in Kinect SDK recorder produced ex¬ 
tremely large files (typically 1.5 GB/min per Kinect). To keep 
the data manageable, we designed a more efficient capture 
format (typically 0.3 GB/min). We used RIFF as a generic 
container to house all time-indexed depth, RGB, and skele¬ 
ton frames in one file. RGB frames were compressed with 
lossy JPEG compression and depth frames with lossless LZF 
compression. Since the Kinect SDK does not output depth, 
RGB, and skeletal frames at a consistent rate, each frame is 
time stamped. We provide Windows C# software to capture 
and playback Kinect data in this format, as well as Python 
software for gesture detection and other analyses. We plan to 
update the file format and tools for the Microsoft Kinect 2. A 
detailed file format description is included with the dataset to 
enable other implementations. 

Public Dataset Concerns 

We were careful to gain approval from our research ethics 
board so that we could make the dataset publicly available. 
Participants were warned of this in advance of arriving at the 
study, and were given 1 week after the study to contact the 
researcher if they had concerned. To ensure the dataset is 
rich and useful in analyzing background activity, full audio is 
included, and faces will not be blurred. While the details of 
the dataset are be publicly available, its full download is only 
be possible after a Terms of Use is agreed to, identifying the 
dataset user as a researcher. 

Participants 

A large amount of background activity is socially motivated, 
so we recruited participants in groups instead of individuals. 
Online posting and word-of-mouth yielded 13 groups of four 
participants, for a total of 52 participants. The mean age was 
26 years (ranging from 19 to 59). Overall, 67% of our partic¬ 
ipants were male, but gender distribution within groups var¬ 
ied: one all-female, four all-male, and the remaining mixed. 
Seven groups used Vicon motion tracking, seven groups in¬ 
cluded prompted foreground gestures, and five groups had 
both. 


In three groups, one participant was meeting the others for the 
first time, but all others had existing social relationships. Pairs 
of participants with closer relationships would often rush to 
the sofa and were often physically affectionate. In one group, 
one participant was frustrated with the other members and 
avoided social interaction he spent most of his time reading 
a newspaper. 

Procedure 

The procedure emphasizes putting participants into a mood 
suitable for the simulated environment. In the case of our 
living room simulation, this meant getting participants com¬ 
fortable and minimizing the feeling of being in a lab. The 
researcher always met participants outside the building and 
guided them to the study room on a route planned to minimize 
time in office spaces. During the walk, the group was engaged 
in small talk to help everyone relax. We wanted participants 
to act as if at home shouting, cheering, joking without wor¬ 
rying about disturbing others working in the building. Study 
times also refiected this social situation, with most group cap¬ 
tures occurring in the evenings or on weekends. 

To increase background activity, food and drinks were placed 
on the coffee table in the study environment, along with dis¬ 
posable plates, cups, and napkins, and a garbage can. Partici¬ 
pants were told to help themselves to the snacks. 

In instances where prompted gestures were collected, the re¬ 
searcher gave instructions on performing them (details be¬ 
low). He then provided instructions on the use of the Xbox 
console. Participants were encouraged to relax and enjoy 
whatever they wished on the television, or to just talk, as long 
as they remained in the simulated living room space and in 
the same order on the furniture. The study ran for 60 min¬ 
utes. During this time, the researcher remained out-of-sight 
in a nearby location monitoring the capture streams in case 
there were any problems, and then gave a five-minute warn¬ 
ing before the study ended. 

Prompted Foreground Gestures 

To capture the difference between background activity and 
intentional gestures, we selected a set of gestures to be 
prompted during the session. We chose four common ges¬ 
tures: Horizontal Swipe, whole-hand AirTap [38], Wave [37], 
and Point [8] (Figure 3). Horizontal swipe is a left or right 
motion (^60cm) with the palm perpendicular to the large dis¬ 
play, arm extended away from the body, and elbow relaxed. 
AirTap is a forward and back movement (^25cm) with palm 
facing the large display. Wave is a left and right periodic mo¬ 
tion (^25cm) with the elbow roughly fixed in space. Point 
extends the arm and index finger towards the television. The 
required duration of Wave and Point were 800ms. These ges¬ 
tures were chosen since they have been used for explicit in¬ 
put, with demonstrated successful detection, but we believed 
they were also likely to occur in background activity. We 
kept the set of gestures small to reduce cognitive load on our 
participants and avoid interference with our primary goal of 
observing background activity. 
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Seven groups (out of 13) were regularly prompted to perform 
gestures to capture foreground activity in the context of back¬ 
ground activity. A 17-inch display below the television (Fig¬ 
ure 2a) prompted people to perform a one of the prompted 
gestures using an iconic representation and audio cue. We 
prompted participants by number (1-4), and each performed 
each gesture at least once. The prompt displayed until the 
gesture was recognized by the researcher monitoring the cam¬ 
era feed (i.e., a Wizard-of-Oz recognition technique). 

The only feedback provided is that the prompt will disappear 
when the gesture is correctly detected. It would be difficult to 
provide high-fidelity feedback when our elicitation procedure 
is Wizard-of-Oz, and it would have prevented us from accept¬ 
ing complex interleavings of foreground and background ac¬ 
tivity. While feedback could possibly improve our true posi¬ 
tive detection rate, the primary goal of this study is to collect 
background activity, which would not be affected by any sort 
of feedback. 

Before the study began, the researcher demonstrated each 
gesture to the group twice. The researcher left the room so 
that each participant could practice following the small dis¬ 
play prompt to perform one gesture. All gesture-training 
demonstrations are included in the dataset. Each gesture was 
prompted five times during the 60-minute session, resulting 
in a foreground gesture sequence approximately every three 
minutes. 

RESULTS 

We captured 1 hour of data per group of 4 people, totalling 
52 person-hours of background activity and 750 GB of data. 

Participant Behaviour 

Most groups played a game or watched television while also 
talking, eating, and using mobile devices. While the televi¬ 
sion display was the primary focus, participants were almost 
always multi-tasking. Participants assumed a wide variety of 
comfortable positions on the furniture that suggest we were 
successful at simulating a realistic living room setting. 

Intensity of background activity varied. Aggressive gesticu¬ 
lation was common, especially for boisterous groups. One 
group of hip-hop dancers was very expressive with a high 
level of dynamic movement. Another group had two of its 
members subconsciously compete to be the center of atten¬ 
tion, outdoing each other in speaking volume and gesticu¬ 
lation intensity. There were also quieter groups, such as a 
married couple and one set of parents. This group quietly 
watched a movie and ate snacks, speaking occasionally. 

Prompted Gestures 

For groups with prompted gestures, we captured a total of 140 
gesture sequences (7 groups x 4 gestures x 5 prompts). We 
noticed that well-intending participants reminded others to 
perform a gesture. This usually involved some communica¬ 
tive gesture similar to the required gesture. Nonetheless, be¬ 
cause this appeared to be an artefact of our setting, we asked 
participants not to engage in this behaviour. 


Capture Quality 

The Kinect captured data at between 15-30 fps. For groups 
with Vicon motion tracking, 6 DOF data for each hat was cap¬ 
tured at between 60-120 fps. At first the tracking hats seemed 
conspicuous to some of the participants, but they relaxed af¬ 
ter 10 minutes or so. Hat tracking data is included in the full 
data set, despite not being included in this paper. 

RECOGNITION OF PROMPTED GESTURES 

Background activity datasets can be used to test different ges¬ 
tures and recognizers. As an example, we use our initial 
dataset to evaluate the performance of a HMM Gesture Spot¬ 
ting Network (GSN) with the four prompted gestures: Swipe, 
Point, Wave, and AirTap. These results are dependent on 
skeletal tracking quality for hand position, a realistic limi¬ 
tation when using current skeletal trackers, especially in en¬ 
vironments like a living room, where relaxed postures might 
cause poor skeleton tracking. 

HMM GSN Design, Implementation, and Training 

Our design is based on Fourney [8] and Lee and Kim [21]. 
A GSN is a meta-Hidden-Markov-Model (HMM) containing 
multiple HMMs connected in parallel. There are left-to-right 
gesture HMMs for each variation of the gesture to be de¬ 
tected and a special threshold HMM representing non-gesture 
movements. A gesture is detected (or spotted) when the final 
state of one of the gesture HMMs has a higher likelihood than 
every state in the threshold HMM. Our left-to-right gesture 
HMMs were constructed of 4 states each. 

Like Fourney, we discretize body-relative hand position and 
velocity into features, although ours are in 3D. We designed 
features by plotting the training gestures and determining how 
best to distinguish between them. We measure the depth of 
the hand relative to the shoulder and its horizontal and vertical 
position relative to the elbow. We take the 3D position and 
assign four discrete features: one binary thresholded radius, 
as well as three features for angles between the hand vector 
and depth sensor-relative axes. The angle features each have 
three possible values, of the form {0 < —7r/4, —7r/4 < 0 < 
'K/A^orTi/A < 0). We found spherical coordinates were a 
good model for the 3D hand position relative to the body, as 
suggested by Freeman et. al [9]. The discretized velocity 
feature is the nearest 3D unit vector of form [{-1,0 ,-h 1},{- 
1,0,-Fl},{-1,0,-Fl}]. 

We found that participants performed Swipe and AirTap with 
a few variations, so a single HMM could not describe these 
gestures. Instead, we trained one gesture HMM for each vari¬ 
ant. Swipe has two variants: elbow straight and elbow bent. 
AirTap motions all began with a quick forward motion, but 
finish with one of three variants: relaxing the arm, dropping 
the arm, or pulling the arm back quickly. 

For training data, six volunteers, who did not participate in 
the study, performed each gesture variant 3 to 15 times while 
seated. After discarding approximately 20% of cases with 
poor tracking or unusual motions, there were 80 training ex¬ 
amples per gesture variant. We trained the gesture HMMs 
using the Baum-Welch algorithm, with 10% of the training 
examples as held-out test data. 
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Figure 3. Diagrammatic representations of the original prompted gestures used in the capture study, followed by the corresponding proposed gestures 
developed using the dataset. They are semantically similar but with substantially less false positives. 


We found 99% accuracy on the test data by comparing like¬ 
lihood of individual gesture HMMs. When we constructed 
a GSN by adding a thresholding background model, accu¬ 
racy reduced to 83%. The worst performing gesture was 
Point (79%), which was often incorrectly recognized as back¬ 
ground (13%). As the Point gesture is simply the user hold¬ 
ing still in the same state, this is difficult to distinguish from 
the same behaviour in the background model by the standard 
method of background model construction. 

Results 

We first evaluate the true-positive rate using the 140 prompted 
gesture sequences. Given noisy data, with participants sit¬ 
ting in relaxed postures often quite close to each other and 
blocked by props such as food items or controllers, the perfor¬ 
mance of the Microsoft Kinect SDK (vl.5) Skeleton tracker 
was affected. We manually examined the skeletal data on 
each prompted gesture sequence and found only 44% of these 
to have decent skeletal tracking. For these, the true positive 
detection rate for prompted gestures was 48%. 

While the detection rate appears low, we consider this fairly 
good, given the diversity of gesture performance, examples 
of which are in the video accompanying this paper. The par¬ 
ticipants did not train against a recognizer, and the researcher 
was intentionally liberal with their definition of a correct ges¬ 
ture in the Wizard-of-Oz procedure. Over the duration of the 
study, gesture performance became subtler, more individual¬ 
ized. We frequently saw participants repeating a gesture mul¬ 
tiple times in quick succession. Swipe and AirTap in particu¬ 
lar changed substantially over time. We think it is important 
to not just study more realistic gestural background activity, 
but more realistic performance of gestures when participants 
are tired, or even bored. Indeed, Negulescu et. al have pro¬ 
posed using a second, lower threshold for recognition when 
two barely-recognizable gestures are performed immediately 
after one another [26]. 

To evaluate false-positive rates, we ran the GSN over each 
tracked skeleton in all background activity sequences. We 
found 73,729 false positives: 38,005 for Swipe, 15,716 for 


Wave, 19,120 for Point, and 888 for AirTap. In total, this is 
one false positive every 5.1 seconds per-participant. We ex¬ 
amined 20 false positives for each gesture and found many 
cases where poor skeleton tracking was the cause. The re¬ 
sults indicate that our proposed gestures are abundant in back¬ 
ground activity, which results in a high false positive recogni¬ 
tion rate even with a reasonable true-positive detection rate. 

Focusing on false-positives with good skeletal tracking, we 
identified five common causes: reaching or manipulating ob¬ 
jects, gesticulating, touching, repositioning, and stretching. 
Reaching or manipulating an object created motions simi¬ 
lar to a point or swipe. Gesticulation led to expressive hand 
movements that could look like any of the gestures. When 
participants touched themselves, such as scratching, a wave 
gesture was often recognized. When participants repositioned 
their body, such as leaning back and extending their arms 
forward on the armrest, this appeared as a forward-extended 
point gesture. Finally, stretching, often with both arms, trig¬ 
gered an AirTap or forward point gesture. In the next sec¬ 
tion we discuss design implications based on these causes to 
reduce these false positives. This is only an initial examina¬ 
tion of false-positive causes; the dataset provides the means 
to complete a more formal analysis. 

As we note before, none of these actions are avoidable in the 
real world. Regardless of how successful the recognizer is 
in identifying these gestures, they will always be susceptible 
to misrecognitions. What we need are gestures that are still 
reasonable to perform but also unique, in the sense that they 
do not frequently appear in the background activity. 

PROPOSING NEW GESTURES 

The prompted gestures we naively chose produced far too 
many false positives to be useful in a real scenario. While 
recognition may be improved by continually researching a 
better recognizer, this will provide diminishing returns. We 
demonstrate the utility of background activity datasets by us¬ 
ing our living room dataset to redesign our gesture set to be 
more robust to the real-world activity, without any changes to 
the design of our gesture recognizer. 
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To test the utility of a given gesture in a certain background 
activity context, we can simply train a detector to recognize 
the gesture, then run it through our data and count the number 
of false positives, where fewer false positives is better. This is 
an extension of previous procedures used in different sensing 
domains [30]. 

We created a set of proposed gestures that semantically cor¬ 
respond to each gesture in our prompted gesture set (Figure 
3). Instead of left and right swipe, we create Pause Swipe, 
a swipe that is preceded by a short pause; this preserves the 
swipes directional property. Instead of point, we create Cir¬ 
cle, meant to be a single circle motion of the extended arm 
parallel to the torso of at least 30 cm in radius; this pre¬ 
serves the point gestures ability to indicate an object by cir¬ 
cling around it, as if with a cursor. Instead of wave, we create 
Vertical Circling a continuous circling motion in the horizon¬ 
tal plane with the arm extended upwards from the elbow; this 
preserves the periodic property of wave, providing a gesture 
that could be performed until a system response is given. In¬ 
stead of AirTap, we implement Forward Up, a push forward 
towards the interface, then an upward flick. This preserves 
AirTap’s diectic sense that a specific location on the surface 
is being activated or approved, similar to a click. 

We trained our gesture recognizer on 10 examples of each of 
these proposed gestures. We ran our same GSN HMM rec¬ 
ognizer through the dataset to look for these gestures, and 
consistently found fewer false positives. For Pause Swipe, 
we found 2,494 false positives (15.2 times less than Swipe); 
for Circle, we found 5,409 false positives (3.5 times less than 
Point); for Vertical Circling, we found 5,172 false positives 
(3 times less than Wave); and for Forward Up, we found 268 
false positives (3.3 times less than AirTap). Overall, we re¬ 
duced the false positive rate by a factor of 5.5. 

We have successfully produced gestures that are not difficult 
to perform, yet are far less common in background activity. 
While we have only created a tested a single alternative to 
each original gesture here, this methodology could be fused 
with other approaches, such as implicit clutching. 

QUALITATIVE OBSERVATIONS 
Body Postures 

A corpus of background data can be used to classify natural 
postures in a given setting. Here, our goal is to classify body 
postures that occur in a comfortable environment like the liv¬ 
ing room. These can be individual postures or combined to 
include multiple bodies. Our results are relevant to under¬ 
standing the availability of a person’s specific body parts to 
provide explicit input for a computer system, which could 
aid in off-line gesture design, as well which type of controls 
the system offers in-the-moment. It is also possible that this 
could motivate a model of typical movements, given a certain 
body posture - this would allow a system to better distinguish 
unusual movement (a candidate for foreground activity) from 
background activity. In addition, this provides motivation for 
improving body and skeletal tracking for this kind of environ¬ 
ment. 


To find static postures, we used a script to extract depth and 
RGB frames from the data where the depth frames had inter- 
frame differences below a threshold for five seconds or more. 
This resulted in 2014 frames from the two scenes (couch and 
chairs). The frame samples are reasonably uniform across 
studies, with a median of 51 samples for the two scenes across 
13 groups. Using these frames, we classified postures accord¬ 
ing to two characteristics: torso lean and arm position. We 
also observed interesting multi-person body postures. 



Figure 4. Torso lean degrees: (a,b) backward lean (least active); (c) 
neutral lean; (d) forward lean (most active). 


Torso Lean 

We found that the degree of torso lean is a useful way to 
gauge how available someone is for performing explicit in¬ 
put. We categorized leans into three levels. In decreasing 
level of availability: forward, neutral, and back (Figure 4). 

A forward lean is when the head and shoulders are in front 
of the hips; arms have less contact with furniture, and atten¬ 
tion focus is forward. This often resulted from handling food, 
mobile devices, or the Xbox controller. 

A neutral lean when the torso is near vertical; arms on arm¬ 
rests with one arm often supporting the head. In this case, one 
arm typically remains available for interaction. 

A backward lean is characterized by the body appearing re¬ 
laxed, with the torso fully supported by the backrest, of¬ 
ten adopting asymmetrical poses with crossed arms and legs. 
This is the least probable torso lean for interaction. 

Arms 

We observed a variety of different arm postures, ranging from 
extended arms far away from the torso, to crossed arms, and 
arms kept close to the body. Body symmetry is indicative of 
which limbs are available for performing explicit input mo¬ 
tions. Any limb supporting the body, head, or other objects 
is unavailable for immediate explicit input. Even when rest¬ 
ing, relaxed extended arms, aimed towards the system were 
indicative of availability (Figure 5). 

Combined Body Postures 

We observed combined body postures where two people sat 
close. This happened when sharing food, viewing another 
person’s mobile device and expressing intimacy. In these 
cases, skeletal and gesture recognizers’ effectiveness was 
very low. Gesture designers could specifically consider close 
postures, for example, designing two-person gestures (Fig¬ 
ure 6). 
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Figure 5. (a) - (d) Examples of arm unavailability: (b) Participant gesturing with the available hand. Note, in the RGB overlay, the other hand is 
occupied with a bag of chips. 



Figure 6. Examples of combined body postures: (a) pressing torsos together; (b) interweaving legs; (c) sharing food. 


Qualitative Evaluation: Body and Skeleton Tracking 

We used our dataset to evaluate Kinect SDK tracking. We 
found that the tracker performs well when people sit upright 
and make large movements, but performs poorly when people 
are seated with legs crossed, leaning, touching other people, 
or holding objects. To investigate methodically, we reviewed 
the 140 prompted gesture sequences. 

We found 62 (44%) of these sequences have properly tracked 
skeletons. Due to issues with low or uneven depth frame rates 
or lack of skeleton recognizer output, 41 sequences (29%) 
have no skeletal data. However, the depth data quality in 33 of 
these sequences should be adequate for post-capture skeletal 
detection using other libraries. 

The remaining 37 (26%) of the sequences represent inter¬ 
esting failure cases. In five sequences (4%), the participant 
was sitting in a position that makes skeleton detection diffi¬ 
cult, such as having their legs crossed or arms folded tightly 
(see body posture observations above). In 15 cases (11%), 
the skeleton was generally correct, but another object was 
erroneously tracked as the dominant hand (often the partic¬ 
ipant’s torso, leg, or parts of the furniture). This failure was 
likely due to the arms being held close to the body or hands 
occupying a small area when extended directly towards the 
camera. In 11 cases (8%), a skeleton was detected away 
from the two primary participants in the scene, such as on 
some of the items in front of the participants, or another 
participant leaning into frame. Since the Microsoft Kinect 
SDK supports a maximum of two skeletons simultaneously 
the addition of this new skeleton resulted in an inability to 
track the participant performing the prompted gesture. For 
six cases (4%), person-tracking merged two people sitting 


close together, creating aberrant skeletons. This was most 
pronounced in one session where a couple sat close together 
on the couch. Two of the sessions without prompted gestures 
also have sequences where body tracking merges people sit¬ 
ting close together. Identifying and correcting these failure 
cases has the potential to improve tracking. 


a b c 



Figure 7. Proposed gesture-specific spatial zones visualized using av¬ 
erage depth occupancy: (a) background sequences; (b) AirTap gesture 
sequences; (c) subtraction revealing spatial gesture zone. 

Gesture-specific spatial zones 

We observed participants performing gestures at greater dis¬ 
tances from their body than typical background motions. To 
operationalize this, we calculated the average body depth 
during background sequences (Figure 7a) and average body 
depth occupancy during prompted gesture sequences for each 
type of gesture (for AirTap, Figure 7b). Subtracting the aver¬ 
age background occupancy from average gesture occupancy 
reveals a spatial zone where that gesture was performed. Al¬ 
though they appear similar, early results indicate that gestures 
may populate spaces not common to background activity. 
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CONCLUSION AND FUTURE WORK 

We described a methodology to capture whole-body back¬ 
ground activity and use it to capture a television-oriented liv¬ 
ing room dataset. To demonstrate the utility of this approach, 
we use the dataset to redesign a gesture set, and substantially 
reduce false positives found by a Hidden Markov Model- 
based Gesture Spotting Network recognizer. A major nov¬ 
elty of this dataset is that it interleaves controlled, prompted 
foreground activity with long periods of multi-person, open- 
ended background activity making this kind of analysis pos¬ 
sible. Our documentation of this process includes critical as¬ 
pects that would be necessary in future work, including social 
considerations, ways to prime activity, and the effect of fur¬ 
niture placement. 

These practical findings are encouraging, but it is important 
to note that our living room dataset and example dataset ap¬ 
plications are primarily intended to illustrate and validate our 
reusable capture methodology. In particular, the large amount 
of rich data recorded, containing a variety of realistic tasks, 
could be used to further explore implicit clutching, natural 
poses, social interaction, etc. The living room dataset and 
supporting capture and analysis tools are made available to 
the research community. 

Our primary contribution is to call attention to back¬ 
ground activity, which has been under-studied and under¬ 
acknowledged in whole-body gestural interfaces appearing in 
the research community. While it is often not feasible to ex¬ 
plore background activity at the very early stages of interac¬ 
tion technique development, it is an important second step 
to fully understand this new interaction paradigm. It would 
be ideal if there was a context-independent set of motions 
characterizing all background activity. While there may be 
some commonalities, our data collection was only in a liv¬ 
ing room context. This is arguably a critical context to study 
background activity given many home entertainment applica¬ 
tions, but making any claim of generalising background ac¬ 
tivity across contexts is premature. 

Our intention is that these methods, tools, and techniques pre¬ 
sented will assist in the research and design of whole body 
gestural interactive systems by motivating the capture and 
sharing of many background activity datasets. In addition, 
our work provides encouraging results for the design of new 
always-on gestures. This supports our argument that under¬ 
standing background activity is crucial to bringing always- 
available whole-body input into the real world. 
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